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Abstract 

Logistic regression is a natural and simple tool to understand how covariates 
contribute to explain the topology of a binary network. Once the model fitted, the 
practitioner is interested in the goodness-of-fit of the regression in order to check if 
the covariates are sufficient to explain the whole topology of the network and, if they 
are not, to analyze the residual structure. To address this problem, we introduce a 
generic model that combines logistic regression with a network-oriented residual term. 
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This residual term takes the form of the graphon function of a TT-graph. Using a 
variational Bayes framework, we infer the residual graphon by averaging over a series 
of blockwise constant functions. This approach allows us to define a generic goodness- 
of-fit criterion, which corresponds to the posterior probability for the residual graphon 
to be constant. Experiments on toy data are carried out to assess the accuracy of the 
procedure. Several networks from social sciences and ecology are studied to illustrate 
the proposed methodology. 

Keywords: Random graphs; logistic regression; hU-graph model; variational approximations 


2 



1 Introduction 


Networks are now used in many scientific fields (Snijders and Nowicki, 1997; Watts and 


Strogatz, 1998 Nowicki and Snijders, 2001 Hoff et al., 2002| Handcock et al., 2007 Zanghi 


et al., 2008) from biology (Albert and Barabasi, 2002 Newman, 2003; Barabasi and 01tvai| 


2004 Lacroix et al., 2006) to historical sciences (Villa et ah, 2008 Jernite et ah, 2014) and 


geography (Ducruet, 2013). Indeed, while being simple data structures, they are yet capable 
of describing complex interactions between entities of a system. A lot of effort has been 
put, especially in social sciences, in developing methods to characterize the heterogeneity 
of these networks using latent variables, covariates, or both (|Hoff et ah 2002 Handcock 


et al. 

2007 

Mariadassou et al. 

2010 

Zanghi et al. 

2010) 


In this paper, we are interested in the contribution of covariates to explain the topology 
of an observed network. To this aim, we consider standard logistic models which are a 
simple way to account for the possible effect of covariates, assuming edges to be independent 
conditionally on the covariates. Our goal is to provide the practitioners with tools to check 
the £t of the model and/or to analyze the residual structure. This goes along with the 
characterization of some residual structure present in the network that is not explained by 
the covariates. Our approach consists in combining logistic regression with the graphon 
function of a W-graph. This additional term plays the role of a very flexible, network- 
oriented residual term that can be visualized and on which a goodness-of-£t criterion can 
be based. 


The IV-graph can be casted among the latent-variable network models (Goldenberg 


et al., 2010| Matias and Robin, 2014). It is characterized by a function W called graphon 


where W (m, v) is the probability for two nodes, with latent coordinates u and n, each 


sampled from an uniform distribution over [0,1], to connect. As shown in Lovasz and 


Szegedy (2006), it is the limiting adjacency matrix of the network. This result comes from 


graph limit theory for which Diaconis and Janson (2008) gave a proper dehnition using 
Aldous-Hoover theorem, which is an extension of de Finetti’s theorem to exchangeable 
arrays. Until recently, few inference techniques had been proposed to infer the graphon 


function of a network. The earliest reference is Kallenberg (1999). Since then, both para¬ 


metric (Hoff, 2008 Palla et ah, 2010) and non parametric (Chatterjee, 2015) techniques 
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have been developed. Graphon inference is a particularly challenging problem which has 
received strong attention is the last few years (Chatterjee, 2015 Airoldi et al.| 2013 Wolfe 


and Olhede, 2013 Asta and Shalizi, 2014 Chan and Airoldi, 2014 Yang et ah, 2014). In 


the present paper, we follow Latouche and Robin (|2015) who took advantage of the fact 


that the well-known stochastic block model (SBM: Holland et ah, 1983 Wang and Wong 


1987 Nowicki and Snijders, 2001) is a special case of W-graph corresponding to a blockwise 


constant graphon. This enables them to derive a variational Bayes EM (VBEM) procedure 
to estimate the graphon function as an average of SBM models with increasing number of 
blocks. 


As mentioned above, the model we consider combines a logistic regression term with a resid¬ 


ual graphon function. Following the Bayesian framework of Latouche and Robin (2015), 


we estimate the residual graphon by averaging over a series of SBM including the one-block 
SBM, which corresponds to a constant residual graphon. We interpret a constant residual 
graphon as an absence of residual structure in the network. This approaches enables us 

(a) to assess the goodness-of-ht of the logistic regression through the posterior probability 
for the residual graphon to be constant and 

{b) to display an estimate of the residual graphon that allows a visual inspection of the 
residual structure. 


As the exact Bayesian inference of this new model for networks is not tractable, we make 
an intensive use of variational Bayes approximations to achieve the inference. Because of 
the combination of logistic regression and SBM, two different types of variational approxi¬ 
mations are actually required. 


In Section we introduce the general model and we dehne the goodness-of-ht crite¬ 
rion. Technical issues and theoretical aspects are addressed in Section Finally, toy 
and real data sets are analyzed in Section and respectively to illustrate the proposed 
methodology. In the body of the article, only undirected networks are considered. The 
extension to directed networks (with proofs and update formulas) is derived in the supple¬ 
mentary materials. The proposed methodology is implemented in the R package GOFNet- 
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work (github.com/platouche/gofNetwork), which will be available on the Comprehensive 
R Archive Network (CRAN). 


2 Assessing goodness-of-fit 

We consider a set of n individuals among which interactions are observed. The observed 
interaction network is encoded in the binary adjacency matrix Y = where Yij is 

1 if nodes i and j are connected, and 0 otherwise. We further assume that a d-dimensional 
vector, d > 1, of covariates is available for each pair of nodes. In the following, we 
denote as X = (a;p)i<ij<n the set of all covariates. 


2.1 Logistic regression and residual structure 

The influence of the covariates on the network topology can be easily accounted for using 
a logistic regression model. Such a model assumes that the edges {Yij) are independent 
(conditionally on the covariates) with respective distribution 

Hq : Yij ~ B [g{xljl3 + a)] , 

where f3 G M'^, a eM., g stands for the logistic function g{t) = 1/(1 + exp(—t)), t G M. Our 
goal is to check if model Hq is sufficient to explain the whole topology of the network. Note 
that the network structure does not explicitly appear in this model, as edges are considered 
as independent outcomes of a (generalized) linear model. 


To assess the £t of Model Hq, we dehne a generic alternative network model. The 
alternative we consider is inspired from the W-graph model. More precisely, we consider 
the model 

H,: Yijr^B[g{xl(3 + m,U,))], 


where the (lA)i<i<n are independent unobserved latent variables, with uniform distribution 
over the (0,1) interval. The non-constant function 0 : (0,1)^ h->■ M encodes a residual 
structure in the network, that is not accounted for by Model Ho. Note that, in absence of 


covariate, this model corresponds to a W-graph (Lovasz and Szegedy, 2006) with graphon 
function g ocj). Model Hq corresponds to the case where the residual function 0 is constant. 
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The present paper focuses on the goodness-of-ht of a regression model, therefore, the 
interpretation of the residual term Uj) is not critical but its visual inspection may help 
to better understand where the residual heterogeneity does come from. Note this generic 
form encompasses additive node effect, which, in absence of regression term, would result 


in a model close to the expected degree model (Chung and Lu, 2002). 


The inference of the function cj) in Model Hi is not an easy task and, following Airoldi 


et al. (2013) and Latouche and Robin (2015), we consider a class of blockwise constant 0 


function. More precisely, we dehne the Model 


M 


K 


Yi^^B[g{xl^ + ZjaZ^)], 


( 1 ) 


where a is & K x K real matrix {K > 1) and where the (^i)i<j<n are independent vectors 
with K coordinates, all zero except one. We denote (1 < < /T) the probability that the 

fcth coordinate is non-zero. Briefly speaking, each vector Zi has multinomial distribution 
Al(l,7r) where n = {7ik)i<k<K- The set of parameters of such a model is 9 = {(3,7T,a). 
Note that in the absence of covariate, this model corresponds exactly to a SBM model. The 
ability of the stochastic block model to approximate the W-graph model is demonstrated in 


Airoldi et al. (2013) and Latouche and Robin (2015) and is not the purpose of this article 


Model Hq is then equivalent to Model Mi so the goodness-of-£t problem can be rephrased 
as the comparison between Model Hq and H[, where 


Hq = Mi 


and 


H[=[j Mk. 


K>2 


2.2 Bayesian model comparison 

Now, we are given a series of Models Mx {K > 1) indexed by K which characterize Hq and 
H[. In this paper, we propose to compare Hq and H[ using a Bayesian model comparison 
framework. 

Thus, each Model Mx is associated to a prior probability p{Mx)- The parameter 9 is 
then drawn conditionally on Mx according to the prior distribution p{9\Mx)- Given 9, Mx 
and the given set X of covariates, the graph is hnally assumed to be sampled according to 
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Model 0 . In this framework the prior probability of Models Hq and H[ 
p{Ho)=p{Mi) and p{H[) = ^ p{Mk). 


are 


K>2 


Moreover, the posterior probability of Model is 

p{Y\Mk)p{Mk) p{Y\Mk)p{Mk) 


p{Mk\Y) = 


p{Y) 


j:^,^^p{Y\MK>)p{MK^y 


( 2 ) 


The goodness of £t of Model Hq can then be assessed by compnting the posterior 
probability of Hq: 

p{Hq\Y)=p{M,\Y). (3) 

The Bayes factor ( Kass and Rafter^ 1995) between Models Hq and H[ can be computed 
in a similar way as 

- where p{Y\Hy = 

K>2 


Bqi — 


where p{Y\H[) =-^^Y,P^^k)p{Y\Mk). 


(4) 


3 Inference 


The goodness-of-£t criteria introduced in the previous section all depend on marginal like¬ 
lihood terms p{Y\Mk) which have to be estimated from the data in practice. This is the 
object of this section. The prior distributions p{Mk) and p{9\Mk) are hrst introduced. A 
variational three steps optimization scheme, based on global and local variational methods, 
is then derived for inference. 

In the following, we focus on undirected networks and therefore both the adjacency 
matrix Y and the matrix X of covariates are symmetric: Yij = Yji and Xij = Xji^\/i y j. 
Moreover, we do not consider self-loops, i.e. the connection of a node to itself and there¬ 
fore the pairs are discarded from the sums and products involved. The complete 

derivation of the model and the inference procedure in the directed case are given as sup¬ 
plementary materials. The Appendix with all proofs in the undirected case is also provided 
as supplementary materials. 


7 








3.1 Prior distributions 


With no prior information on which model should be preferred, we give equal weights 
p{Hq) = p{H[) = 1/2 to Hq and H[. Therefore, p{Mi) = 1/2. Alternative choices can be 
made by integrating expert knowledge at hand. Recall that p{H[) = J2 k>2P(^k)- 

For Model Mk, the prior distribution over the model parameters in 6 is dehned as a 
product of conjugate prior distributions over the different sets of parameters: p{9\Mk) = 
p(/ 3 |Mi^)p( 7 r|M^)p(a|Mi^). Since vr is involved in a multinomial distribution to sample the 
vectors Zj, a Dirichlet prior distribution is chosen 


p{7i\Mk) = Dir( 7 r;e), 


where e is a vector with K components such that = cq > 0, V/c G {1,..., K}. Note that 
hxing Co = 1/2 induces a non-informative Jeffreys prior distribution which is known to be 


proper (Jeffreys, 1946). It is also possible to obtain a uniform distribution over the K — 1 
dimensional simplex by setting cq = 1. 

In order to characterize the d-dimensional regression vector a Gaussian distribution 
is considered 

pW\v, Mr) 0, ^ ’ 


with Id the d X d identity matrix and rj > 0 a parameter controlling the inverse variance. 
Similarly, the matrix a is modeled using a product of Gaussian distributions with 7 > 0 
controlling the variance 

^ / 1 

p{a\'y, Mr) = n-A/" f akf, 0, - 

k<i A 7 

Since we focus on undirected networks, a has to be symmetric and therefore the product 

involves the k < I terms of a. In the directed case (see supplementary materials), the 
product is over all terms k, I and the vec operator, which stacks the columns of a matrix 
into a vector, is used to simplify the calculations. 

Finally, Gamma distributions are considered for 7 



and p 


p{pi\Mr) = Gam( 7 ; oq, &o), ao, &o > 0, 


p{r]\MR) = Gam( 7 ; cq, do), Cq, do > 0 . 


8 





By construction, Gamma distributions are informative. In order to limit the influence 
on the posterior distributions, the hyperparameters controlling the scale (oq, cq) and rate 
{bo, do) are usually set to low values in the literature. 

The choice of modeling the prior information on the parameters a and fd from such 
Gaussian-Gamma distributions has been widely considered both in standard Bayesian 


linear regression and Bayesian logistic regression (see for instance Bishop and Svensen 


2003 Bishop, 2006). The prior distributions p{j3\MK) and p{a\MK) are then obtained by 
marginalizing over p{ri\MK) and p{'y\MK) respectively. This results in prior distributions 


from the class of generalized hyperbolic distributions. For more details, we refer to Garon 


and Doucet (2008). 


In the following, and in order to simplify the notations, the dependency on Mx is 
omitted in the prior and posterior distributions. 


3.2 Variational approximations 

Denoting Z the set of all latent vectors {Zi)^ the marginal log-likelihood of Model Mk, also 
called the integrated observed data log-likelihood, is given by 

logp(y|Mi^) = log|^ j p(Y\Z,a, /3)p{Z\7i)p{a\-f)p{/3\p)p{7i)p{j)p{p)d7idad/3d'jdri^ . 

(5) 

It requires a marginalization over the prior distributions of all parameters. In particular, 
it involves testing all the conhgurations of Z. Unfortunately, ([^ is not tractable and 
therefore we propose to rely on variational approximations for inference purposes. Let us 
hrst consider the global variational decomposition 


logp(y|Mx) = £^(9) +KL(g(.)||p(.|r,MK)) 


( 6 ) 


Maximizing the functional Ck{-), which is a lower bound of logp(U|M;^), with respect to 
the distribution g(-), is equivalent to minimizing the Kullback-Leibler divergence between 
q{-) and the unknown posterior distribution p{-\Y). Ck{-) is given by 

p(y,Z,7r,a,/3,7,r7) 


g(Z,7r,a,/3,7,p) log ^ 


g(Z,7r,a,/3,7,7) 


-d7rdad/3d7d7. 
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In order to maximize the lower bound, we assume that the distribution can be factorized 
as follows: 

n 

q{Z, TT, a, /3, 7 , T]) = q{7r)q{a)q{^)q{-f)q{r]) JJ q{Zi). 


2=1 


Unfortunately, Ck{-) is still intractable due to the logistic function in p{Y\Z,a, (3). Fol¬ 


lowing the work of Jaakkola and Jordan (2000), a tractable lower bound is derived. 


Proposition 1 Given any n x n positive real matrix ^ = {iij)i<iy<n, o lower bound of the 
first lower hound is given by 


\ogp{Y\MK) > CK{q) > 0 , 


where 


r ( c\ f (7 n V Op{Z, tt, a, (3, 7, v) . ^ i 

^K{q] 0 = 2^ I ?(^,7r,a,/f,7,7)log-—---r-d7rdad/3d7d?7. 


and 


log h{Z, a, fi,0 = Yl ~ 2 '^iZjaZj + xhfi) + \og 

with fij G fij = fji- Moreover, X{fij) = {g{fij) — 1/2) /{2^ij), g being the logistic 
function. 

The proof is given in Appendix A.l. The quality of the lower bound Cxig] 0^ which was 
obtained through a series of Taylor expansions, clearly depends on the choice of the matrix 


As we shall see in Section 3.2.2 ^ can be estimated from the data to obtain tight bounds. 


3.2.1 Variational Bayes EM 

For now, we assume that the matrix ^ is hxed and we rely on Cxiq', 0 ^ lower bound 


of log p{Y\Mk). In order to maximize the lower bound, a VBEM algorithm (Beal and 


Ghahramani, 2002) is applied on Cxig', 0- This optimization scheme is iterative and is 


related to the EM algorithm (Dempster et ah, 1977). Keeping all distributions hxed except 


one, the bound is maximized with respect to the remaining distribution. This procedure 
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is repeated in turn until convergence of the bound. The optimization of the distribution 
q{Z) over the latent variables usually refers to the variational E step. The updates of q{Ti), 
q{a), q{(3), ^( 7 ), and q{r]) refer here to the variational M step. Propositionprovides the 
update formula of the E-step and Propositions to provide these of the M-step. The 
corresponding proofs are given in Appendix A.2 to A.7. 

Proposition 2 The variational E update step for each distribution q{Zi) is given by: 

q{Zi) = M(Zi; l,Ti), 

where 'Yl!k=i Ek = 1 and 

{ K n ^ K n 

{{Yij ~ 2'^ ~ Y 

l=l l=l 

K 

l=l 

ipl') denotes the digamma function which is the logarithmic derivative of the gamma func¬ 
tion. 


Proposition 3 The variational M update step for the distribution q{7r) is given by: 

q{Ti) = Dir{7i] e"'), 

where, V/c G {1,..., K}, = Cq + Y^=i Tk, Tk being given by Proposition^ 

Proposition 4 The variational M update step for the distribution q{(3) is given by: 

q{(3) = A/'(/?; mg,Sy), 


where 


and 


Sy ^ = ^Id + Y 
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Proposition 5 The variational M update step for the distribution q{'y) is given by: 


q{j) = Gam( 7 ; an,bn), 

where On = ao + and = &o + | Y.k<i ■ 

Proposition 6 The variational M update step for the distribution q{r]) is given by: 

q{rj) = Gam(7; Cn,rfn), 

where = cq + | and dn = do + ^Tr(S'/ 3 ) + Sy and my being given by Proposition 

0 

Proposition 7 The variational M update step for the distribution q{a) is given by: 

K 

*?(«) = II-^ {ma)kh (o'a)fcO , 

k^l 


where 


O’, 


^2\—1 _ 

(y./kk r 


^ ^ jki Vfc, 


^^3 


n 

i^Dki = ^ + 2 X{^ij)rikTji,Wk ^ I, 

{mo)kk = {(^Dkk^ \ TikTjk.^k, 

{ma)ki = {(rDki ^ (iXij - ^) - \ nkTju^k ^ 1. 

3.2.2 Optimization of ^ 

So far, we have seen how the lower bound Ck{,T 0 of ^ogp{Y\MK) could be maximized 
with respect to the distribution g(Z, vr, a, /3, 7 , 7 ). However, we have not addressed yet how 
f could be estimated from the data. Given a distribution g(-), we propose to maximize 
Ck{,T 0 with respect to each variable fij in order to obtain the tightest bound Ck{t 0 of 


logp(H|Mii-). This follows the work of Bishop and Svensen (2003) on Bayesian hierarchical 


mixture of experts and Latouche et ah (2011, 2014) on the overlapping stochastic block 


model. As shown in the following proposition, this leads to new estimates fij of fij. 
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Proposition 8 The estimate of ^ij maximizing Cxiq', 0 given by 


^ij — 


K 


K 


\ Y1 [«fez] + 2 TikTji{m^)kixlmg + TliXijxJjiSg + mgm})). 

\ k,l k,l 


Note that fij = ^ji,Wi ^ j since the networks considered are undirected. 


This gives rise to a three steps optimization scheme. Given a matrix the variational 
E and M steps of the VBEM algorithm are used to maximize Cxig] 0 with respect to 
g(-). This distribution is then held hxed and the bound is maximized with respect to f. 
These three steps are repeated until convergence of the lower bound. The proof is given in 
Appendix A.8. 


3.3 Estimation 

Goodness-of-fit For any K, we have seen how variational techniques could be used to ap¬ 
proximate the marginal log-likelihood logp(F|M/f) using a lower bound Ck ■= maxg^g 
As exposed in Section M our goodness-of-£t procedure relies on the posterior probability 
of K, that is p{Mk\Y). Indeed, this posterior distribution cannot be derived in a exact 


manner but, as shown in Volant et al.| (2012), the distribution p{Mk\Y) that minimizes 


the Kullback-Leibler divergence with p{Mk\Y) satisfies 


p{Mk\Y) (xp{MK)exp{CK}- 


The approximate posterior probability of Ho is then p{Ho\Y) = p{Mi\Y) and the corre¬ 
sponding approximate posterior Bayes factor i?oi, dehned in (|^, can be computed in the 
same manner. 

The following proposition, which is proved in Appendix A.9, shows that many terms of 
Cxig] 0 vanish, when computed after a specihc optimization step, so that the lower bound 
takes a simpler form. 
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Proposition 9 If computed right after the variational M step, the lower bound is given by 


CKiq-, « = ^ - f + j + '“g ^ ^ 

b d 

+ aologbo + a„(l - ^ - log6„) + Cologc^o + c„(l - ^ - ^ogd^) 

. K n K K 

+ 9 + 9 log I>5/31 - log Tik + - - -mlS^^rUg 


T{ar 


r(c„) 


k<l 


i=l k=l 


k<l 


1 ” 1 
+ 0^1 


i¥=j 


where C{x) = r(^fc) / T ( X]fc=i r(-) is the gamma function. 


yK 


Residual structures While the main object of this work is to provide tools to assess 
the goodness of £t of a logistic regression model for networks, the considered variational 
algorithm also provides a natural way to estimate the residual structure 0. We recall that, 
under Model Hq, i.e. the network is completely explained by the covariates, the function 
0 is constant. 

Still, under the alternative Model Hi, a residual structure remains, that is encoded 
in 0. As a consequence, an estimate of this function can be useful to investigate the 
residual structure, similarly to the residual plot classically used in a regression context. 
Removing the covariate effect, recall that Mk is a SBM model. Therefore, an approximate 
posterior mean can be derived, relying on the VBEM model averaging approach considered 
Latouche and Robin (2015) for SBM. Proposition [Io| provides the approximate posterior 


m 


mean of the function 0, that we propose as the network counterpart of the residual plot in 
regression. Note that it results from an integration over all model parameters and Models 
Mk. 


Propositiou 10 From Proposition 1 in Latouche and Robin (2015), for{u,v) G [0,1]^, u < 
V, the approximate posterior mean of the residual strueture 0 is 

E[0(u,u)|y] = Y.p{Mk\Y)^4>{u,v)\Y,Mk], 

K>1 
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where 


E[(j){u,v)\Y,MK] = [Fk-i,i-i{u,v,e) - Fk,i-i{u,v, e) - Fk-i,i{u,v, e) + Fk,i{u,v,e)]. 

k<l 

Fk^i{u,v;e) denotes the joint cdf of the Dirichlet variables (cTfc,cr/) such that cr^ = 
and 71 has a Dirichlet distribution Dir{e). 


As mentioned in Section |2.1[ the residual structure (j) is related to the graphon function of 
hh-graph models, which suffer from identihability issues. Indeed, for any measure preserv¬ 
ing transformation a of [0,1] to [0,1], the function (f)„{u,v) = cj) {a{u),a{v)) leads to the 
same model as with the function (j){u,v). To tackle this issue, the common approach is to 
assume that the mean function J u)du is increasing in u. This identihability constraint 
was applied when producing the residual structure plots presented in the following section. 


4 Simulation study 

In order to assess the proposed methodology, we carried out a series of experiments on 
simulated data hrst and then on real data. In this section, we focus on the estimation of 
the posterior probability p{Ho\Y). We aim at evaluating the capacity of the approach to 
detect Hi using toy data. Similar results were obtained for the estimated Bayes factors 
Bqi and identical conclusions were drawn. 

4.1 Simulation design 

We simulated networks using Model Hi. Thus, each node is hrst associated to a latent 
position Ui sampled from a uniform distribution over the (0,1) interval. Then, a vector 
of covariates Xi G is drawn for each node, using a standardized Gaussian distribution, 
i.e. with zero mean and covariance matrix set to the identity matrix, with d = 2. In 
order to construct the covariate vector Xij G for each edge (i,j) with {i < j), we hxed 
Xij = Xi — Xj. For the function 0(-, •)> considered a design inspired by the one proposed 

. In this work, the graphon function is W{u, v) = pX‘^{uv)^~^ 
where the parameter p > 0 controls the graph density and A > 0 the degree concentration. 


m 


Latouche and Robin (2015 
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For more details, we refer to Latouche and Robin (2015). Note that the maximnm of 
the graphon fnnction is p\^ so A < 1/y/p mnst hold since hF(-, •) is a probability. In onr 
case, the probabilities for nodes to connect are given throngh a logistic fnnction g{-) and 
therefore we set (j){u,v) = g~^[p\‘^{uv)^~^). For A = 1, the fnnction 0(-, •) is constant and 
so the networks are actnally sampled from Model Hq. Conversely, for all A > 1, data sets 
come from Model Hi. As A increases, the residnal strnctnre, not acconnted for by Model 
Hq, becomes sharper and thns easier to detect. 

We considered networks of size n = 100 and n = 150 as well as three valnes for the 
parameter p G {10“^, 10“^'®, 10“^} helping controlling the sparsity. Finally, we tested 20 
different valnes of A in [1,5]. For each of the triplets (n, p. A), we simnlated 100 networks 
and we applied the methodology we propose for valnes of K between 1 and 10. Becanse the 
variational algorithm depends on the initialization, as any EM like procednre, for each K 
it was rnn twice and the best run was selected, such that the lower bound was maximized. 
Note that equal prior probabilities were given for the Models Mx {K > 2) such that 
p{H[) = 1/2. Moreover, we set Oq = 5o = Cq = do = = 1- 


4.2 Results 

Estimation of p(Ho\Y). The results are presented in Figure It appears that for 
low values of A, the median (indicated in bold on the boxplots) of the estimated values of 
p{Hq\Y) is 1 and goes to 0, when A increases, as expected. The results for the scenario with 
the highest sparsity (p = 10“^) and n = 100 are unstable although the median values share 
this global property. Much stable results were obtained for larger networks. Interestingly, 
experiments can be distinguished in the way Model Hi is detected. As soon as A > 1, 
then the true model responsible for generating the data is Hi and so the probability of 
Model Hq should be lower than 1/2. In practice, the estimated probability p{Ho\Y) is 
lower than 1/2 for slightly larger values of A. For instance, for p = 10“^'® and n = 150, 
p{Hq\Y) k. 0 for A = 1.8. For p = 10“^ and n = 100 the detection threshold appears sooner, 
for A = 1.6. The experiments illustrate that Hi is detected more easily, as the network size 
n and (density) parameter p increase. Overall the results are encouraging with particularly 
low detection threshold. For p = 10“^ and n = 150, Model Hi is always detected when 
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Figure 1: Boxplots of the estimated values 'p{Hq\Y) of the posterior probability p{Ho\Y), 
obtained with the variational approximations, for values of A ranging from 1 to 5. Six 
scenarios considered with the number n of nodes in {100,150} and the sparsity parameter 
p in (10“^, 10“^'^, 10“^}. Model Hq is true for A = 1 and false for A > 1. 

n = 100 nodes n = 150 nodes 

■°Dmi. 


1—I—I—I— I— I — I — I — I —I —I—I—I—I—I—I—I—I—I—r 

1 1.4 1.8 2.3 2.7 3.1 3.5 3.9 4.4 4.8 
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1 1.4 1.8 2.3 2.7 3.1 3.5 3.9 4.4 4.8 


present as soon as A > 1.2. 
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Computational cost. To give some insight into the compntational cost of the proposed 
methodology, we recorded the rnnning time for the estimation of p{Hq\Y)^ in various con¬ 
ditions. Note that the inference strategy can easily been parallelized. Therefore, to give a 
fair evaluation, we applied the methodology once for each network generated, on a unique 
core. The results presented in Table [T] were obtained on an Intel Xeon CPU 3.07GHz, 
for A = 2 and p = 10“^. As expected, the running time increases as the network size 
n becomes higher. Similarly, increasing the number d of covariates induces an additional 
computational effort. Again, the methodology proposed involves testing various values of 
K (from 1 to 10 in these experiments) which can be done in parallel to reduce signihcantly 
the running times. If a core is used for each value of K, then the running time is given 
essentially by the slowest run, usually for the largest value of K. For information, the 
corresponding running times are also indicated in parenthesis in Table [T] 


size of the network [n] 

d = 2 

d = 5 

d=10 

100 

0.47 (0.1) 

0.6 (0.12) 

0.72 (0.14) 

250 

3.42 (0.73) 

4.74 (0.88) 

5.97 (1.26) 

500 

18.03 (3.73) 

20.28 (4.17) 

24.43 (4.91) 


Table 1: Averaged running times (in minutes) for the estimation of p{Hq\Y), for various 
sizes n of networks and various values of d. In parenthesis, the averaged running times (in 
minutes) for 77 = 10. 

5 Illustrations 

We applied our approach to analyze a series of networks of various sizes and densities, from 
social sciences and ecology. For all studies, equal prior probabilities were given for the 
Models Mk {K > 2) such that p{H[) = 1/2. Moreover, we set ao = bo = cq = do = eo = 1. 
The variational algorithm was run on each network for K between 1 and 16. For each K, 
the procedure was repeated 20 times and the run maximizing the lower bound was selected. 

Coding of the covariates. The model we propose involves a regression term where 
Xij is a vector of covariates for edge (f, j). In some situations, edge descriptors Xij, such as 
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(phylogenetic, geographic) distances, are actually available. But in many situations, only 
node descriptors Xi and Xj are available and building an edge descriptor Xij from node 


descriptors is not a straightforward task (see e.g. 

Hunter et al. 

200^ 

)). For all networks 

(except the blog network to be consistent with 

Latouche and Robin 

2015p), we adopted 


the following coding rules. Quantitative edge descriptors were treated as quantitative 
regressors. For quantitative node descriptors, the absolute difference Xij = \xi — Xj\ was 
used as a quantitative covariate. For ordinal node descriptors Xj G {1,... L}, we considered 
the absolute difference |xj — Xj\ but we treated it as a factor, with L — 1 levels. Qualitative 
node descriptors with L levels were transformed into qualitative edge descriptors with 2L 
levels, each node level ^ giving rise to two edge levels: one indicating if both i and j have 
level £ and one indicating if either i or j (but not both) has level 


5.1 Description of the datasets 

Blog network. The network is made of 196 vertices and was built from a single day 


snapshot of political blogs extracted on 14th October 2006 (Zanghi et ah, 2008). Nodes 


correspond to blogs and an edge connect two nodes if there is an hyperlink from one blog to 
the other. They were annotated manually by the “Observatoire Presidentiel” project such 
that, for each node, labels are available. Thus, each node is associated to a political party 
from the left wing to the right wing and the status of the writer is also given (political 
analyst or not). This data set has been studied in a series of works (Zanghi et ah, 2008| 


Latouche et ah, 2011, 2014) where all the authors pointed out the crucial role of the 


labels in the construction of the network. We considered a set of three covariates xtj = 
(xlj, x^j, xfj) G artificially constructed to analyze the influence of both the political 
parties and the writer status. We set xh = 1 if blogs i and j have the same labels, 0 
otherwise. Moreover, x'fj = 1 if one of the two blogs i and j is written by political analysts, 
0 otherwise. Finally, xfj = 1 if both are written by political analysts, 0 otherwise. 


Tree network. This data set was hrst introduced by Vacher et ah (2008) and further 


studied in Mariadassou et ah (2010). We considered the tree network which describes the 


interactions between 51 trees where two trees interact if they share at least one common 
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fungal parasite. Three quantitative edge descriptors are available characterizing the genetic, 
geographic, and taxonomic distances between the tree species. 


Karate network. The karate data set describes the friendships between a subset of 
34 members of a karate club at a university in the US, observed from 1970 to 1972. It 


was originally studied by Zachary (1977). When the study started, an incident occurred 


between the club president and a karate instructor, over the price of the karate lessons. 
The entire club then became divided over this issue, as time passed. The network is made 
of four known groups characterized by a node qualitative descriptor, taking four possible 
values, for each node in the network. 


Florentine marriage network. We considered the data set analyzed by Breiger and 


Pattison (1981) in their study of local role analysis in social networks. It characterizes 


the social relations among 16 Renaissance Florentine families and was built by John Pad¬ 
gett from historical documents. Two nodes are linked is the two families share marriage 
alliances. Three quantitative node covariates are provided for each family, namely the 
family’s net wealth in 1472 in thousands of lira, the family’s number of seats on the civic 
councils held between 1282 and 1344, and the family’s total number of business and mar¬ 
riage ties in the entire data set. 

Florentine business network. This data set is similar to the Florentine marriage net¬ 
work described previously except that edges now describe business ties between families. 
We considered exactly the same covariates. 


Faux Dixon High network. Contrary to all networks presented in this work, this data 
set is directed and therefore we employed the inference algorithm for the directed case, 
as presented in the supplementary materials. This network characterizes the (directed) 
friendship between 248 students. It results from a simulation based upon an exponential 


random graph model fit (Handcock et ah, 2008) to data from one school community from 


the AddHealth Study, Wave I (Resnick et ah, 1997). Node covariates are provided, namely 
the grade, sex, and race of each student. The grade ordinal attribute has values 7 to 12, 
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indicating each student’s grade in school. Moreover, the race qualitative attributes can 
take 4 values. 


CKM. This data set was created by Burt (1987) from the data originally collected by 


Coleman et ah (1966). The network we considered characterizes the friendship relationships 


among physicians, each physician being asked to name three friends. The physicians were 
also asked to answer to a series of questions regarding their profession. We focused here on 
13 questions corresponding to node covariates among which four are qualitative descriptors: 
city of practices (4 values), discussion with other doctors (3 values), speciality in a held of 
medicine (4 values), proximity with other physicians (4 values). All other node covariates 
were treated as quantitative variables. Note that we imputed the missing values in the 


data set using the missMDA R package (Josse and Husson, 2016). 


AddHealth 67. This data set is related to the Faux Dixon network described previously. 
However, it was constructed from the original data of the AddHealth study, and not simu¬ 
lated from any random graph model. The AddHealth study was conducted using in-school 
questionnaires, from 1994 to 1995. Students were asked to designate their friends and to 
answer to a series of questions. Results were collected in schools from 84 communities. In 
our study, we considered a network associated to school community 67 which characterizes 
the undirected friendship relationships between 530 students. As for the Faux Dixon net¬ 
work, three node covariates are available. The sex qualitative covariate takes two values. 
Moreover, the grade ordinal attribute has values from 7 to 12. However, contrary to the 
Faux Dixon network, hve values are present in the data for the race qualitative attribute. 


5.2 Results 


The estimated values of p{Hq\Y) for all networks are presented in TableFor illustration 
purposes, the estimations of the residual structures g o ^ are also provided in Figures 


and|^ In practice, we used Proposition 10 to estimate 0 and then applied g{-) to obtain 
graphon-like surfaces. There is no standard dehnition of kF-graph models in the directed 
case and therefore, for the Faux dixon high network, only the estimation of p{Hq\Y) is 
given. 
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Network 

size (n) 

nb. covariates (d) 

density 

p{Ho\Y) 

Blog 

196 

3 

0.075 

3.60e-172 

Tree 

51 

3 

0.54 

2.36e-115 

Karate 

34 

8 

0.14 

3.38e-2 

Florentine (marriage) 

16 

3 

0.17 

0.995 

Florentine (business) 

16 

3 

0.125 

0.991 

Faux Dixon High 

248 

17 

0.02 

1 

CKM 

219 

39 

0.015 

1 

AddHealth 67 

530 

21 

0.007 

2.10e-25 


Table 2: Estimation of p{Hq\Y)^ for the eight networks considered. 


As shown in Table Model Hq was rejected for the blog, tree, karate and AddHealth 
networks. Indeed, we obtained values of 'p{Hq\Y) close to zero for the four data sets, 
indicating that the corresponding covariates cannot explain entirely the construction of 
these networks. For the blog network, we can observe in Figure (top right) that g o ^ 
is not constant which is coherent with Model Hq being rejected. We also give in this 
figure (top left) the estimated residual structure without taking the covariates into account 
(d = 0). Clearly, the shape of 5 ^ o 0 is simpler when d = 3. In particular, many of the hills 
on the diagonal vanish when adding the covariates. Thus, the covariates help in studying 
and explaining parts of the network. However, they are not sufficient and some of the 
heterogeneity observed in the network cannot be explained by political parties and writer 
status. Similar conclusions can be drawn from the tree , karate, and AddHealth networks 
(Figure]^ and Figure]^. Indeed, the terms g o ^ simplify when adding the covariates but 
remain non constant. In particular, for the tree network considered, this means that the 
interactions between trees through common fungal parasite cannot be entirely explained 


by the distances available which is consistent with a these from Mariadassou et ah (2010) 
who describe a residual heterogeneity in the valued version of this network, after taking 
the covariates into account. 

For all other networks considered, model Hq was chosen. Indeed, for the Florentine 
marriage and business networks, we found P{Hq\Y) = 0.995 and p{Hq\Y) = 0.991 respec- 
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Figure 2: Estimation of the blog (top) and tree (bottom) networks residual structure 
without (left) and with (right) covariates. 



tively. As expected, the residual structures g o (p were found constant when adding the 
covariates (Figure]^. Moreover, the variational approach led to p{Ho\Y) = 1, for the Faux 
Dixon High and CKM networks. Thus, the statistical framework we propose shows that no 
other effect than these of the covariates contributes signihcantly to explain the structure of 
these networks. In other words, once corrected for the covariates, no residual heterogeneity 
is observed among the interactions. 


6 Conclusion 

In this paper we proposed a framework to assess the goodness of £t of logistic models for bi¬ 
nary networks. Thus, we added a generic term, related to the graphon function of IF-graph 
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Figure 3: Estimation of the karate (top) and AddHealth (bottom) networks residual 
structure without (left) and with (right) covariates. 
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Figure 4: Estimation of the Florentine marriage (top), Florentine business (middle), and 
CKM networks residual structure without (left) and with (right) covariates. 























models, to the logistic regression model. The corresponding new model was approximated 
with a series of models with blockwise constant residual structure. A Bayesian procedure 
was then considered to derive goodness-of-£t criteria. All these criteria depend on marginal 
likelihood terms for which we did provide estimates relying on variational approximations. 
The hrst approximation was obtained using a variational decomposition while the second 
involves a series of Taylor expansions. The approach was tested on toy data sets and en¬ 
couraging results were obtained. Finally, it was used to analyze eight networks from social 
sciences and ecology. We believe the methodology has a large spectrum of applications 
since covariates are often given when analyzing binary networks. 
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SUPPLEMENTARY MATERIAL 


Appendix: Give all proofs of the paper. (Appendix.pdf) 

Directed case: Describe the inference procedure for the directed case. (Directed.pdf) 
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